-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add HPO<->UMLS and HPO<->MeSH mapping to MedGen release #18
Conversation
CC @souzadevinicius Just adding some SSSOM outputs for the MedGen ingest. |
fe3fecb
to
4ac6825
Compare
src/utils.py
Outdated
os.remove(temp_filtered_config_path) | ||
|
||
|
||
def write_tsv_with_comments(df: pd.DataFrame, comments_file: Union[Path, str], outpath: Union[Path, str]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General: write_tsv_with_comments()
@matentzn Let me know what the existing best practice is supposed to be for this.
I have this function (and several others it's used with) in the repos for GARD, MedGen, ICD11, a TIMS repo, and possibly elsewhere. Maybe there's a function in sssom-py
that does this like I want? I have a dataframe and a path to a sssom-metadata.yml
, and I want to write that to a file and include the comment at the top.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep,
- instantiate a MappingSetDataFrame() object with the dataframe
- use
write_table()
method to frite the result
See example: https://github.com/mapping-commons/sssom-py/blob/be8a77b5a2ef42cd2b92b74d892d7626dec27471/src/sssom/cli.py#L280
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recall this a bit now, thanks. Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done~
Now my write_sssom()
function has been rewritten to utilize sssom-py
:
def write_sssom(df: pd.DataFrame, config_path: Union[Path, str], outpath: Union[Path, str]):
"""Writes a SSSOM file"""
with open(config_path, 'r') as yaml_file:
metadata: Dict = yaml.load(yaml_file, Loader=yaml.FullLoader)
converter = curies.Converter.from_prefix_map(metadata['curie_map'])
msdf: MappingSetDataFrame = MappingSetDataFrame(converter=converter, df=df, metadata=metadata)
with open(outpath, 'w') as f:
write_table(msdf, f)
As with my icd11 PR which had the same refactor, while working on this I also encountered some possible bugs and UX issues, and made:
MappingSetDataFrame
issues when noconverter
passed mapping-commons/sssom-py#513- UX: Extra entries in
curie_map
of written TSV mapping-commons/sssom-py#514 - UX: Second empty line break at bottom of file mapping-commons/sssom-py#515
- Suggestion:
write_table
: pass outpath alternative toTextIO
mapping-commons/sssom-py#516 - Suggestion:
MappingSetDataFrame
:.to_csv()
andread_csv()
mapping-commons/sssom-py#517
@@ -39,3 +39,5 @@ jobs: | |||
output/release/medgen.obo | |||
output/release/medgen-disease-extract.obo | |||
output/release/medgen-xrefs.robot.template.tsv | |||
output/release/hpo-umls.sssom.tsv | |||
output/release/hpo-mesh.sssom.tsv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SSSOM Outputs
I ran a new build and created a release which includes these. Here are some sample rows.
Questions
- Double checking that you 100% want only UMLS mappings
Should MedGen-derived prefixes just be "UMLS", or one of ("UMLS", "MEDGENCUI", or "MEDGEN") if/where applicable?
UMLS: Compose 100% of the subjects inhpo-umls.sssom.tsv
currently.
MEDGENCUI: Note that there were no MEDGENCUI<->HPO mappings in this set inMedGenIDMappings.txt
. Also note that I did see your instruction in Finalise MedGen xref table #15: Remove all rows with MEDGENCUI in it (we only need xrefs to MEDGEN:123 and UMLS:123) (please see & respond to my related question in that issue). However, applied our previous logic discussed in "Refactoring > Prefix assignment", and verified afterwards that there was no MEDGENCUI.
MEDGEN: These are IDs that have neither CN or C at the beginning; they are UIDs. I did not check to see if we could actually get HPO<->MEDGEN mappings. Do you want them? - Subject/object order & filename
I think I renamedhpo-umls.sssom.tsv
incorrectly. Just realizing this now. I have the subjects as UMLS. But HPO comes first in the filename; so perhaps I should make it so that the position is consistent? Does it matter to you which is the subject and which is the object? - Are my columns good?
I could includemapping_justification
, but I don't know their process, and I don't know if we can say if there is or isn't variability in how the do their mappings.
hpo-umls.sssom.tsv
subject_id | subject_label | predicate_id | object_id |
---|---|---|---|
UMLS:C0000727 | Acute abdomen | skos:exactMatch | HP:0033400 |
UMLS:C0000729 | Abdominal cramps | skos:exactMatch | HP:0032155 |
UMLS:C0000731 | Abdominal distention | skos:exactMatch | HP:0003270 |
hpo-mesh.sssom.tsv
subject_id | predicate_id | object_id |
---|---|---|
HP:0000003 | skos:exactMatch | MESH:D021782 |
HP:0000005 | skos:exactMatch | MESH:D040582 |
HP:0000011 | skos:exactMatch | MESH:D001750 |
This one doesn't have subject_label
or object_label
because I couldn't guarantee that the label in MedGenIDMappings.txt
was an accurate reflection of either source in 100% of cases. I imagine these labels are coming either from UMLS or from MedGen.
hpo-mesh-no-matches.sssom.tsv
hpo-mesh-no-matches.sssom.tsv.zip
subject_id | predicate_id | object_id | umls_id | umls_label |
---|---|---|---|---|
skos:exactMatch | MESH:C000591739 | UMLS:C1970109 | Aromatase excess syndrome | |
skos:exactMatch | MESH:C000596385 | UMLS:C3495589 | Jalili syndrome | |
skos:exactMatch | MESH:C000597569 | UMLS:C4042185 | Teratoid Rhabdoid Tumor |
Not included in the release. This is for analysis. I just wanted you to see the cases where there were no matches. There were only about 2,300 HPO<->MeSH mappings that could be derived out of the ~16,000 MeSH terms that have UMLS mappings. I included umls_id
and umls_label
in this set for review. It also includes all of the rows for all of the matches; not just the non-matches. I sorted the non-matches to the top.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double checking that you 100% want only UMLS mappings
For now yes. Maybe in the future (but not now) we will look for medgen mappings as well, but for now, we only want HPO-UMLS mappings.
- @joeflack4 Ensure this is done (probably already so)
I think I renamed hpo-umls.sssom.tsv incorrectly. Just realizing this now. I have the subjects as UMLS. But HPO comes first in the filename; so perhaps I should make it so that the position is consistent? Does it matter to you which is the subject and which is the object?
Its cosmetic, ask your own inner sense of style. I like the order, but wont insist on it if it takes a lot of time to fix
- @joeflack4 flip the order in the file name
I could include mapping_justification, but I don't know their process, and I don't know if we can say if there is or isn't variability in how the do their mappings.
Please use sssom toolkit as you want to release a valid sssom file. mapping_justification
is mandatory. Since its MedGen
I would tend to use semapv:ManualMappingCuration
.
- @joeflack4 add
mapping_justification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I edited your comment w/ checkboxes. Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All done!
91e9e37
to
bff4784
Compare
- Add: Code for 2 new outputs: hpo-umls.sssom.tsv and hpo-mesh.sssom.tsv General - Add: Python requirement: PyYaml - Update: ORCID in SSSOM config - Bug fix: Edge case for missing intermediate file in pipeline - Remove: Intentionally duplicative, no longer needed 'MEDGENCUI' xrefs - Add: Refactor to load mapping set using function that handles lots of repetitive stuff: get_mapping_set()
- Update: Utilizing standard sssom-py functionality - Add: sssom validate - Add: mapping_justification - Update: Subject/object order & filename - Add: Mandatory filtering of UMLS matches only in HPO-UMLS SSSOM.
#medgen.sssom.tsv: medgen.obographs.json | ||
# sssom parse medgen.obographs.json -I obographs-json -m config/medgen.sssom-metadata.yml -o $@ | ||
sssom: umls-hpo.sssom.tsv | ||
sssom validate umls-hpo.sssom.tsv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SSSOM validation
As with the ICD11 PR, I also added SSSOM validation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not do a detailed code review, but what I saw looks good! Thanks!
addresses #16
Changes